The UN speech that US president Donald Trump gave to the United Nations General Assembly on Sept. 19, 2017 was fascinating for several reasons. For me personally, it was interesting because the speech included surprisingly complex sentences and statements, not characteristic of Trump’s previous talks. It was also intriguing because this speech was quite undiplomatic and fierce to a point that steered the world to the brink of a thermonuclear war. At least North Korea is genuinely pissed.

As it can be expected when a president addresses the UN, several countries were mentioned in the speech; some in a favorable and others in a negative context. I wanted to make an analysis that connects the sentiments of the speech with the countries, to see how these countries are regarded by the US.

library(magrittr)
library(tidyverse)
library(rvest)
library(tidytext)
library(forcats)
library(wordcloud)
library(wordcloud2)
library(rworldmap)
library(stringr)
library(ggrepel)
# Get speech excerpt
url <- "http://www.politico.com/story/2017/09/19/trump-un-speech-2017-full-text-transcript-242879"
speech_excerpt <- 
        read_html(url) %>% # Download whole homepage
        html_nodes("style~ p+ p , .lazy-load-slot+ p , .fixed-story-third-paragraph+ p , .story-related+ p , p~ p+ p") %>% # Select the required elements (by css selector)
        html_text() %>% # Make it text
        .[-2] %>% # Remove some random homepage text
        gsub("Mr.", "Mr", ., fixed = T) %>% # Make sure that dots in the text will not signify sentences
        gsub("Latin America","latin_america",.) %>% # Not to confuse with USA
        gsub("United States of America", "usa", .) %>% # USA has to be preserved as one expression
        gsub("United States", "usa", .) %>% 
        gsub("America", "usa", .) %>% 
        gsub("Britain", "uk", .) %>% # UK is mentioned 
        gsub("North Korea", "north_korea", .) %>% # North Korea should be preserved as one word for now
        data_frame(paragraph = .)

# Tokenize by sentence                        
speech_sentences <- 
        speech_excerpt %>% 
        unnest_tokens(sentence, paragraph, token = "sentences")
        
# Tokenize by word
speech_words <- 
        speech_excerpt %>% 
        unnest_tokens(word, paragraph, token = "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        # Here comes a nasty manual stemming of country names. Sadly, I failed to get satisfactory results  on country names with standard stemmers (I tried snowballC, hunspell, and textstem). I also tried to create a custom dictionary with added country names to no avail. What am I missing? Anyway, this works.
        mutate(word = word %>% 
                       str_replace_all("'s$","") %>% # Cut 's
                       if_else(. == "iranian", "iran", .) %>% 
                       if_else(. %in% c("usans", "north koreans"), str_replace(., "ns$",""),.) %>% 
                       if_else(. %in% c("usan","syrian","african","cuban","venezuelan"), str_replace(., "n$",""),.)
        )

Exploring the text

The following word cloud shows the words mentioned in the speech. Larger words were used more frequently than smaller ones.

speech_words %>% 
        anti_join(stop_words, by = "word") %>% 
        count(word, sort = TRUE) %>% 
        wordcloud2()
        # wordcloud2(figPath = "trump.png") # Wanted to make a wordcloud in the form of Trump's head, but the package has a know bug that prevented me to do so.

Next, I looked at the most frequent emotional words - that was used at least three times - in the speech. It turns out that the majority of frequent emotional words had a positive connotation (e.g. prosperity, support, strong, etc.). From the negative words, the most frequent emotional words were related to conflict (conflict, confront, etc.).

# Check emotional words that were uttered at least 3 times
speech_words %>% 
        count(word) %>% 
        inner_join(get_sentiments("bing"), by = "word") %>% 
        filter(n >= 3) %>% 
        mutate(n = if_else(sentiment == "negative", -n, n)) %>% 
        ggplot() +
                aes(y = n, x = fct_reorder(word, n), fill = sentiment) +
                geom_col() +
                coord_flip() +
                labs(x = "word", 
                     y ="Occurance in speech",
                     title = "Most common words in Trump's 17/09/19 UN speech by sentiment")

Just to show the less frequent emotional words too, the next word cloud shows all emotional word sentiments.

speech_words %>%
        inner_join(get_sentiments("bing"), by = "word") %>% 
        count(word, sentiment, sort = TRUE) %>%
        spread(sentiment, n, fill = 0L) %>%
        as.data.frame() %>% 
        remove_rownames() %>% 
        column_to_rownames("word") %>% 
        comparison.cloud(colors = c("red", "blue"))

Let’s look into specific emotions using the NRC sentiment dictionary! It is also possible to make an association between certain words and distinct emotions. The next plot shows the frequency of each emotion in the talk It seems like the emotion that dominated the speech was trust, followed by fear and anticipation.

speech_words %>%
        inner_join(get_sentiments("nrc"), by = "word") %>% # Use distinct emotion dictionary
        filter(!sentiment %in% c("positive","negative")) %>% # Only look for distinct emotions
        group_by(sentiment) %>% 
        count(sentiment, sort = T) %>% 
        ggplot() +
                aes(x = fct_reorder(sentiment %>% str_to_title, -n), 
                    y = n, 
                    label = n) +
                geom_col() +
                geom_label(vjust = 1) +
                theme_minimal() +
                labs(title = "The occurance of words linked to distinct emotions in the speech", 
                     x = "Word", 
                     y = "Frequency")

Let’s put the sentiments on a map

First, let’s see which countries were mentioned and how many times. Obviously, the (United States of) America is first! Iran, Venezuela, North Korea were also mentioned more than 5 times.

# Load map database
map_world <- 
        map_data(map="world") %>% 
        mutate(region = region %>% str_to_lower()) # Make country name lower case to match word

# Calculate mentions of a country, and join geodata
trump_countries <-
        speech_words %>% 
        count(word) %>% 
        right_join(map_world, by = c("word" = "region")) %>% # Match country coordinates to speech
        select(region = word, everything())

# Get country names with the middle of the country coordinates
country_names <- 
        trump_countries %>% 
        drop_na(n) %>%
        group_by(region) %>% 
        summarise(lat = mean(lat),
                  long = mean(long))

Let’s see which countries are mentioned the most times. Obviously, the USA! Also Iran, Venezuela, and North Korea are mentioned several times. Apart from these, most countries are mentioned only a couple of times during the speech.

trump_countries %>% 
        ggplot() +
        aes(map_id = region, 
            x = long, 
            y = lat, 
            label = paste0(region %>% str_to_title(),": ", n)) +
        geom_map(aes(fill = n %>% log10(.)), 
                 map = trump_countries) +
        geom_label_repel(data = trump_countries %>% 
                                 drop_na(n) %>% 
                                 group_by(region) %>% 
                                 slice(1), 
                         alpha = .75) +
        scale_fill_gradient(low = "lightblue", 
                            high = "darkblue", 
                            na.value = "grey90") +
        labs(title = "Number of mentions by country", 
             x = "Longitude", 
             y = "Latitude") +
        theme_minimal() +
        theme(legend.position = "none")

Next, I wanted to see how the speech developed over time, and what was the sentiment of the sentences. Moreover, I wanted to include which countries were mentioned in particular parts of the talk.

# Sentiment of each sentence
sentence_sentiment <-
speech_sentences %>% 
        mutate(sentence_num = row_number(),
               sentence_length = length(sentence)
        ) %>% 
        unnest_tokens(word, sentence, "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        # Here comes a nasty manual stemming of country names. Sadly, I failed to get satisfactory results  on country names with standard stemmers (I tried snowballC, hunspell, and textstem). I also tried to create a custom dictionary with added country names to no avail. What am I missing? Anyway, this works.        
        mutate(word = word %>% 
                       str_replace_all("'s$","") %>% # Cut 's
                       if_else(. == "iranian", "iran", .) %>% 
                       if_else(. %in% c("usans", "north koreans"), str_replace(., "ns$",""),.) %>% 
                       if_else(. %in% c("usan","syrian","african","cuban","venezuelan"), str_replace(., "n$",""),.)
        ) %>% 
        left_join(get_sentiments("bing"), by = "word") %>%
        mutate(sentiment_score = case_when(sentiment == "positive" ~ 1,
                                           sentiment == "negative" ~ -1,
                                           is.na(sentiment) ~ NA_real_)) %>%
        group_by(sentence_num) %>%
        summarise(sum_sentiment = sum(sentiment_score, na.rm = T),
                  sentence = paste(word, collapse = " "))

# Which sentence has a country name
country_sentence <- 
        speech_sentences %>% 
        mutate(sentence_num = row_number()) %>% 
        unnest_tokens(word, sentence, "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        right_join(country_names %>% select(region), by = c("word" = "region")) %>% 
        arrange(sentence_num)

# Sentiment for each country
country_sentiment <-         
        sentence_sentiment %>% 
        full_join(country_sentence, by = "sentence_num") %>% 
        select(region = word, sum_sentiment) %>% 
        drop_na() %>% 
        group_by(region) %>% 
        summarise(country_sentiment = sum(sum_sentiment, na.rm = T))

Checking how the speech sentiment develops over time, and what countries are mentioned

First, it is important to note that the sentiment analysis is based on the summarized sentiments for each sentence, which can be misleading. For example, in the middle of the speech, Israel and the US are mentioned in a very negative sentence. However, the negative tone was created to condemn Iran in the next sentence. So, you can see that the isolated analysis of the sentences can be misleading. To mitigate this error, I calculated the rolling mean for the sentence sentiments, so each sentence now contains the “spillover” sentiment from previous and following sentences.

sentence_sentiment %>% 
        full_join(country_sentence) %>% 
        mutate(roll_sentiment = zoo::rollmean(sum_sentiment, 3, fill = 0, align = "center")) %>% # Calculate a rolling mean with a window of 3
        mutate(sentiment_type = case_when(roll_sentiment > .5 ~ "positive",
                                          roll_sentiment < (-.5) ~ "negative",
                                          (roll_sentiment > -.5 & roll_sentiment < .5) ~ "neutral") %>% # Label sentence sentiments based on rolling mean
                       fct_rev()
        ) %>% 
        ggplot() +
                aes(x = sentence_num, 
                    y = roll_sentiment, 
                    label = word %>% str_to_title()) +
                geom_hline(yintercept = 0, 
                           color = "grey", 
                           linetype = "dashed", 
                           size = 1.2) +
                geom_line(size = 1.2, 
                            color = "black") +
                geom_label_repel(aes(fill = sentiment_type), 
                                 alpha = .8, 
                                 segment.alpha = 0.5) +
                scale_fill_manual(values = c("green","grey","red")) +
                theme_minimal() +
                labs(x = "Sentence number", 
                     y = "Sentence sentiment", 
                     title = "The summarised sentiment of sentences, and the appearance of country names in the speech \nby sentiment in sentence order",
                     subtitle = "The dashed line signifies neutral sentence sentiment. \nCountry label colors show the direction of the sentiment (positive/negative)") 

As we can see, Trump started off the speech with positive statements, and mostly praised the USA. Then he came up with his black list, with North Korea, China, Ukraine, Russia, and Israel mentioned in a negative context. Then the speech took a positive turn, and several Middle Eastern and African countries were referenced in a generally favorable context.

South American countries - such as Cuba, and especially Venezuela - were also scolded here, and as the US

So, how about summarizing the country sentiments throughout the whole text, and plot them on a map, to see which countries

sentiment_map_data <- 
        trump_countries %>% 
        left_join(country_sentiment, by = "region")

sentiment_map_data %>% 
        mutate(country_sentiment = if_else(region == "usa", NA_real_, country_sentiment)) %>% # Exclude US
        ggplot() +
                aes(    map_id = region, 
                        x = long, 
                        y = lat, 
                        label = paste0(region %>% str_to_title(), ": ", country_sentiment)
                        ) +
                geom_map(aes(fill = country_sentiment), 
                         map = trump_countries) +
                scale_fill_gradient(high = "green", 
                                    low = "red", 
                                    na.value = "grey90") +
                geom_label_repel(data = sentiment_map_data %>%
                                         drop_na(n) %>%
                                         group_by(region) %>%
                                         slice(1),
                                         alpha = .5
                                 ) +
                theme_minimal() +
                labs(title = "Sentiment of the sentences where countries were mentioned (USA excluded)", 
                     x = "Longitude", 
                     y = "Latitude")